This document explains the methodology of the correlations analysis for San Jose social distancing compliance and summarizes key results. It uses data on social distancing through 5/4/2020.
library(tidyverse)
library(plotly)
library(sf)
library(mapview)
library(tigris)
library(censusapi)
library(leaflet)
library(lehdr)
library(usmap)
options(
tigris_class = "sf",
tigris_use_cache = TRUE
)
The data used for social distancing compliance comes from Safegraph’s social distancing dataset. In this analysis, we used specifically the data on devices “completely at home,” which Safegraph defines as devices that did not leave their usual nighttime location (see documentation at https://docs.safegraph.com/docs/social-distancing-metrics). For each census block group in San Jose, we calculated the average percent of devices completely at home on weekdays since the start of the Bay Area shelter-in-place order (3/16/2020), as well as the percent of devices completely at home on weekdays during the months of January and February 2020, prior to the shelter-in-place order and widespread COVID-19 concerns. From these results, we obtain the percent of devices leaving home during these time periods.
In our analysis, we examined the correlations between percent of devices leaving the home before and after the shelter in-place-order was instated and various demographic variables, including income, age, language ability, race, ethnicity, education level, vehicle ownership, occupants per room in a household, sex of workers, and high speed internet access. Information on the demographic variables at the census block group level was obtained from the American Community Survey 2018 data. We also assessed the correlations between these demographic variables and the change in percent of devices staying completely at home after the shelter-in-place order relative to before the order. This latter metric should indicate the ability of a community alter their behavior to comply with the shelter-in-place order.
Here we present a summary of the key significant results from our correlations analysis.
Income was a strong predictor of percent of devices leaving the home during the shelter-in-place order period. We considered different income thresholds, and concluded that percent of households earning over 125,000 annually was the best predictor. In the graph below, the percent of devices leaving the home is plotted against the percent of households making over 125,000, with each block group represented as a point on the graph. The best fit linear trendline is shown in orange. The slider on the bottom switches the data between percent of devices leaving home before the shelter-in-place order to after the shelter-in-place order.
# load data
sj_dem_distancing_pre_post <- readRDS("/Users/simonespeizer/Documents/2020 Spring Quarter/CEE 218Z/covid19/Simone_Speizer/sj_socialdistancing_demdata_prepostdifs_manyvars.rds")
# combine the data so that plots can be animated with trendlines
# get the before shelter in place data
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home pre shelter`, blockgroup)
sj_dem_distancing_pre_shelter[is.na(sj_dem_distancing_pre_shelter)] <- 0
# relabel column
colnames(sj_dem_distancing_pre_shelter)[1] <- "% leaving home"
# add back demographic variables
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>% left_join(sj_dem_distancing_pre_post)
# get trendlines
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>%
mutate(
income_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% over 125,000`)),
hispanic_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% non hispanic/latino`)),
educ_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`percent associates or higher`))) %>%
cbind(`Before or After Shelter-in-Place` = "Before shelter-in-place")
# repeat for post shelter in place
sj_dem_distancing_post_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home`, blockgroup)
sj_dem_distancing_post_shelter[is.na(sj_dem_distancing_post_shelter)] <- 0
# relabel column
colnames(sj_dem_distancing_post_shelter)[1] <- "% leaving home"
# add back demographic variables
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>% left_join(sj_dem_distancing_pre_post)
# get trendlines
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>%
mutate(
income_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% over 125,000`)),
hispanic_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% non hispanic/latino`)),
educ_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`percent associates or higher`))) %>%
cbind(`Before or After Shelter-in-Place` = "After shelter-in-place")
# combine them
sj_dem_distancing_pre_post_separate <- rbind(sj_dem_distancing_pre_shelter, sj_dem_distancing_post_shelter)
# convert the before/after column to factor so it shows up correctly on the plots
sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place` <- factor(sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place`, levels = c("Before shelter-in-place", "After shelter-in-place"))
fig_income <-
plot_ly (sj_dem_distancing_pre_post_separate) %>%
add_trace(
x = ~`% over 125,000`,
y = ~`% leaving home`,
frame = ~`Before or After Shelter-in-Place`,
type = 'scatter',
mode = 'markers',
showlegend = F
) %>%
add_trace(
x = ~`% over 125,000`,
y = ~income_trendline,
type = 'scatter',
mode = 'lines',
line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
frame = ~`Before or After Shelter-in-Place`,
showlegend = F
) %>%
animation_button(visible = F) %>%
animation_slider(
pad = list(t =75),
currentvalue = list(visible=F)
) %>%
layout(xaxis = list(title = 'Percent of households making over $125,000'), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))
fig_income
From this figure, we see that during the shelter-in-place order period a higher percentage of households making over 125,000 in a block group correlates with fewer devices leaving the home in that block group. This trend is the opposite of that observed prior to the shelter-in-place order, suggesting that block groups with a greater percentage of households of higher income were more able to adjust their behavior to comport with the shelter-in-place order.
To better assess this relative change in behavior, we fit a linear model to the change in devices staying completely at home after the shelter-in-place order (relative to before the order) and the percent of households earning more than 125,000. The results of that model, including the coefficient on income (the slope of the linear fit) and the R-squared value, are shown below.
Coefficient:
income_125_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(income_125_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.2697 0.77542 17.11 3.084e-53
## `% over 125,000` 0.3097 0.01713 18.08 5.072e-58
R-squared:
print(summary.lm(income_125_model_dif)$r.squared, digits = 4)
## [1] 0.3656
From the coefficient value, we see that as the percent of households with incomes over 125,000 increases by 1%, the difference between the percent of devices staying completely at home after the shelter-in-place order and the percent completely at home before the order increases by about 0.3%. The R-squared value assess the degree to which this model accurately predicts the variation in change in devices staying completely at home observed in the data; the result of 0.37 indicates that the linear fit with income predicts about 37% of the observed variation. The low p value indicates that these results are significant. This is a relatively strong prediction, even without examining the effect of other demographic variables.
When we considered the Hispanic/Latino population of a block group and percent of devices leaving home, we observed that a greater percentage of non-Hispanic/Latino residents in a block group correlates with a lower percent of devices leaving home.
fig_hisp <-
plot_ly (sj_dem_distancing_pre_post_separate) %>%
add_trace(
x = ~`% non hispanic/latino`,
y = ~`% leaving home`,
frame = ~`Before or After Shelter-in-Place`,
type = 'scatter',
mode = 'markers',
showlegend = F
) %>%
add_trace(
x = ~`% non hispanic/latino`,
y = ~hispanic_trendline,
type = 'scatter',
mode = 'lines',
line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
frame = ~`Before or After Shelter-in-Place`,
showlegend = F
) %>%
animation_button(visible = F) %>%
animation_slider(
pad = list(t =75),
currentvalue = list(visible=F)
) %>%
layout(xaxis = list(title = 'Percent of residents that are not Hispanic or Latino'), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))
fig_hisp
The results of the linear model fitting the change in percent of devices staying completely at home and the percent of residents that are not Hispanic/Latino are shown below.
Coefficient:
hispanic_model_dif <- lm(`% increase in staying completely home` ~ `% non hispanic/latino`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4942 1.10795 9.472 7.369e-20
## `% non hispanic/latino` 0.2277 0.01546 14.730 8.085e-42
R-squared:
print(summary.lm(hispanic_model_dif)$r.squared, digits = 4)
## [1] 0.2768
As the percent of individuals that are not Hispanic/Latino increases by 1%, the change in percent of devices staying completely at home increases by about 0.23%. This linear model with Hispanic/Latino population predicts about 28% of the observed variation in the change in devices staying completely at home, comparable with the linear fit for education that was previously shown. The results are again statistically significant.
However, we hypothesized that the correlation observed here might be related to underlying correlations between Hispanic/Latino population and other demographic variables. To test this, we first performed a multiple regression analysis with both income and Hispanic/Latino population, yielding the following results.
Coefficients:
hispanic_inc_model_dif <- lm(`% increase in staying completely home` ~ `% non hispanic/latino` + `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_inc_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5146 1.01569 9.368 1.746e-19
## `% non hispanic/latino` 0.1018 0.01839 5.535 4.761e-08
## `% over 125,000` 0.2325 0.02175 10.688 2.068e-24
R-squared:
print(summary.lm(hispanic_inc_model_dif)$r.squared, digits = 4)
## [1] 0.3982
This result indicates that Hispanic/Latino population and income together predict about 40% of the variation in change in percent of devices staying completely at home. This is an improvement of 3% over the model with income alone, indicating that after controlling for income, Hispanic/Latino population still provides some–though more limited–predictive power. Both variables have significant p values.
Hypothesizing that education and other race or ethnicity variables might be additional underlying factors, we next considered a linear model that again includes income and Hispanic/Latino population, but also incorporates education and percent of residents that are Asian; the results are shown below.
Coefficients:
hispanic_inc_educ_asian_model_dif <- lm(`% increase in staying completely home` ~ `% non hispanic/latino` + `% over 125,000` + `percent associates or higher` + `% Asian`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_inc_educ_asian_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.35312 1.00046 9.3488 2.056e-19
## `% non hispanic/latino` 0.01427 0.02702 0.5282 5.976e-01
## `% over 125,000` 0.21492 0.02337 9.1974 7.049e-19
## `percent associates or higher` 0.09492 0.03153 3.0105 2.725e-03
## `% Asian` 0.07275 0.01636 4.4473 1.048e-05
R-squared:
print(summary.lm(hispanic_inc_educ_asian_model_dif)$r.squared, digits = 4)
## [1] 0.4237
When accounting for education, income, and Asian population, Hispanic/Latino population loses its predictive ability–its p value is no longer significant–though all three of the other variables are significant. Income appears to be the key variable, followed by education and Asian population. This combined model predicts 42% of the variation in the change in percent of devices staying completely at home.
As we have shown, income, education level, and Asian population together are very strong predictors of the change in percent of devices staying completely at home. These three variables, when combined with child and young adult population, yielded a model with the greatest predictive ability for the change in percent of devices staying completely at home. Though the two age variables were not strong predictors on their own, when included in a multivariable model they did provide additional predictive power beyond that of a model with only income, education level, and Asian population. The parameters of the best-predicting model are presented below.
Coefficients:
inc_educ_asian_child_yad_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000` + `percent associates or higher` + `% Asian` + `percent less than 18` + `percent 20-29`, sj_dem_distancing_pre_post)
print(summary.lm(inc_educ_asian_child_yad_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6729 1.97741 3.375 7.902e-04
## `% over 125,000` 0.1684 0.02326 7.242 1.464e-12
## `percent associates or higher` 0.1380 0.02280 6.051 2.622e-09
## `% Asian` 0.0847 0.01432 5.913 5.834e-09
## `percent less than 18` 0.2290 0.05022 4.559 6.307e-06
## `percent 20-29` -0.1444 0.04323 -3.341 8.893e-04
R-squared:
print(summary.lm(inc_educ_asian_child_yad_model_dif)$r.squared, digits = 4)
## [1] 0.4731
All five of these variables are significant in this model. Higher income, higher education attainment, higher percent of residents that are Asian, and higher percent of residents that are children are all associated with larger increases in percent of devices staying completely at home; higher percent of residents that are ages 20-29 is associated with smaller increases in percent of devices staying completely at home following shelter-in-place. These five variables together predict about 47% of the variation in the change in percent of devices staying completely at home.
Note that income alone predicted about 37% of the change in percent of devices staying completely at home, while adding in education raised this to 40%. Including percent of residents that are Asian boosted the predictive power to 42%, and adding percent of residents that are younger than 18 and between the ages of 20-29 raised it to 47%.
Here we summarize the results for other demographic variables we considered that did have some correlation with changes in staying at home, but that were not found to be important in the highest-predicting multiple regression model. These variables include high speed internet access, occupants per room in a household, English language ability, and Spanish language ability.
The analysis on internet access, specifically the percent of households that have access to high speed internet, was inspired by the paper “Social Distancing, Internet Access and Inequality” by Chiou and Tucker (https://www.nber.org/papers/w26982) that found that the combination of high speed internet access and high income was the key driver of ability to stay at home. We did indeed find a correlation between increase in percent of devices staying completely at home and percent of households with broadband such as cable, fiber optic or DSL (coefficient 0.37, p value < 2e-16, R-squared 0.22). However, high speed internet access did not provide any additional information to a model that already incorporated income; including high speed internet access in a regression with income raised the R-squared value by about 0.009 relative to the R-squared of 0.37 for the model with income alone, and did not provide additional useful information in the multivariable regression model.
Similarly, though the percent of households that have 1 or fewer occupants per room was also correlated with change in devices staying at home (coefficient 0.37, p value < 2e-16, R-squared 0.17), this metric also failed to provide additional predictive power over income alone (R-squared 0.373 for income and occupants per room combined).
Percent of residents speaking English well provided some, but less, predictive power on its own (coefficient 0.37, p value < 2e-16, R-squared 0.12), but again was not significant in multiple regression analyses that incorporated other demographic variables.
Percent of residents speaking Spanish did offer some predictive power (percent of residents NOT speaking Spanish had a coefficient of 0.24, p value < 2e-16, R-squared 0.24) but is a very similar metric to the Hispanic/Latino population variable, and was similarly insignificant when combined with education and income.
Demographic variables we considered that lacked strong correlations with changes in staying at home include percent of residents ages 65 and older (R-squared 0.04), percent of residents that are white (R-squared 0.005), percent of households with a vehicle available (R-squared 0.08), and percent of workers that are male (R-squared 0.0003).
Our full analyses can be viewed here https://stanfordfuturebay.github.io/simone_sd_correlations_analysis_sj_01.html.